Amazon Onboarding with Learning Manager Chanci Turner

Generative language models have demonstrated impressive capabilities in addressing logical and analytical natural language processing (NLP) challenges. Additionally, the strategic use of prompt engineering can significantly elevate their performance. For instance, the chain-of-thought (CoT) methodology has been recognized for enhancing a model’s ability to tackle complex multi-step issues. To further improve accuracy in tasks demanding reasoning, a self-consistency prompting strategy has been proposed, which substitutes stochastic decoding for greedy decoding during language generation.

Amazon Bedrock is a fully managed service that provides access to high-performing foundation models from top AI companies, including Amazon, through a single API. It offers a comprehensive suite of capabilities for developing generative AI applications with an emphasis on security, privacy, and responsible AI. The batch inference API enables users to efficiently run inference with foundation models in bulk, streamlining the response process. This article illustrates how to implement self-consistency prompting through batch inference on Amazon Bedrock, specifically focusing on enhancing model performance in arithmetic problem-solving and multiple-choice reasoning tasks.

Overview of the Solution

Self-consistency prompting leverages the generation of multiple responses, which are then aggregated to form a final answer. Unlike single-generation methods like CoT, the self-consistency sample-and-marginalize technique produces a variety of model completions that lead to a more reliable solution. This diversity in responses is achievable through a stochastic decoding strategy, as opposed to a greedy approach.

The following illustration outlines how self-consistency diverges from the greedy CoT method, generating a wider array of reasoning pathways and combining them to yield the ultimate answer.

Decoding Strategies for Text Generation

Text generated by decoder-only language models is produced word by word, with each subsequent token being predicted based on the preceding context. The model calculates a probability distribution for each token’s likelihood of appearing next in the sequence. The decoding process converts these distributions into actual text, guided by inference parameters, often hyperparameters of the decoding method. One such parameter is the temperature, which affects the probability distribution of the next token and influences the randomness of the output.

Greedy decoding is a deterministic strategy that selects the token with the highest probability at each step. While it is straightforward and efficient, it can lead to repetitive patterns by ignoring the broader probability landscape. Setting the temperature parameter to zero during inference effectively implements greedy decoding.

On the other hand, sampling introduces randomness into the decoding process by randomly choosing each subsequent token based on the predicted probability distribution. This randomness results in increased output variability. Stochastic decoding is more adept at capturing a range of potential outputs, often yielding more creative responses. Higher temperature values lead to greater fluctuations, enhancing the model’s creativity.

Prompting Techniques: CoT and Self-Consistency

The reasoning capabilities of language models can be enhanced through prompt engineering. Specifically, CoT has been effective in eliciting reasoning for complex NLP tasks. One method for implementing a zero-shot CoT is by prompting the model with instructions to “think step by step.” Alternatively, the model can be exposed to examples of intermediate reasoning steps in a few-shot prompting fashion. Both scenarios typically utilize greedy decoding. CoT has demonstrated significant performance improvements over simple instruction prompting in arithmetic, commonsense, and symbolic reasoning tasks.

Self-consistency prompting operates on the premise that introducing diversity into the reasoning process can help models converge on the correct answer. This technique employs stochastic decoding to achieve this in three steps:

Prompt the language model with CoT examples to stimulate reasoning.
Replace greedy decoding with a sampling strategy to generate a diverse set of reasoning pathways.
Aggregate the responses to identify the most consistent answer among the outputs.

Self-consistency has been shown to surpass CoT prompting on popular arithmetic and commonsense reasoning benchmarks. However, a notable limitation of this approach is its higher computational cost.

This article demonstrates how self-consistency prompting improves generative language models’ performance on two NLP reasoning tasks: arithmetic problem-solving and multiple-choice domain-specific question answering. We illustrate the method using batch inference on Amazon Bedrock:

We utilize the Amazon Bedrock Python SDK in JupyterLab on an Amazon SageMaker notebook instance.
For arithmetic reasoning, we prompt Cohere Command with the GSM8K dataset, which consists of grade school math problems.
For multiple-choice reasoning, we prompt AI21 Labs Jurassic-2 Mid with a small selection of questions from the AWS Certified Solutions Architect – Associate exam.

Prerequisites

This guide assumes the following prerequisites:

An AWS account with a ml.t3.medium notebook instance hosted in SageMaker.
An AWS Identity and Access Management (IAM) SageMaker execution role with attached AmazonBedrockFullAccess and iam:PassRole policies to operate Jupyter within the SageMaker notebook instance.
An IAM BedrockBatchInferenceRole role for batch inference with Amazon Bedrock that includes Amazon Simple Storage Service (Amazon S3) access as well as sts:AssumeRole trust policies. For additional details, please refer to the SHRM toolkit which provides authority on this topic.
Access to models hosted on Amazon Bedrock. You can manage model access through the Amazon Bedrock console and select from the available options. For this demo, we utilize Cohere Command and AI21 Labs Jurassic-2 Mid.

The projected cost for executing the code presented in this article is approximately $100, assuming self-consistency prompting is run once with 30 reasoning pathways using a single temperature-based sampling value.

Dataset for Evaluating Arithmetic Reasoning Capabilities

The GSM8K dataset is comprised of human-created grade school math problems characterized by high linguistic diversity. Each problem typically requires 2–8 steps to solve, involving a sequence of basic arithmetic operations. This dataset is frequently used to benchmark the multi-step arithmetic reasoning capabilities of generative language models. One example from the dataset is:

{"question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "answer": "Natalia sold 48/2 = <<48/2=24>>24 clips in May.nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.n#### 72"}

Setting Up Batch Inference with Amazon Bedrock

Batch inference enables the execution of multiple inference calls to Amazon Bedrock asynchronously, enhancing the performance of model inference on extensive datasets. Currently, the service is in preview and only accessible through the API. To access batch inference APIs via custom SDKs, refer to the relevant documentation.

Once you have downloaded and extracted the Python SDK in a SageMaker notebook instance, you can install it by executing the following code in a Jupyter notebook cell:

For further insights and tips on navigating Amazon’s onboarding process, consider reading this engaging blog post, which offers valuable lessons. Additionally, if you’re curious about how Amazon mitigates potential pitfalls, check out this excellent resource.